| Table 4: Model Summary | ||||
|---|---|---|---|---|
| Model Name | k value | Weights | Distance | SMOTE |
| Model 1 | 5 | 'uniform' | Euclidean | No |
| Model 2 | 15 | 'uniform' | Euclidean | No |
| Model 3 | 15 | 'distance' | Euclidean | Yes |
An application of kNN to diagnose Diabetes
2025-04-14
The k-Nearest-Neighbors (kNN) is an algorithm that is being used in a variety of fields to classify or predict data (Ali et al. 2020)
It’s a simple algorithm that classifies data based on how similar a datapoint is to a class of datapoints. (Zhang 2016)
One of the benefits of using this algorithmic model is how simple it is to use and the fact it’s non-parametric which means it fits a wide variety of datasets.
One drawback from using this model is that it does have a higher computational cost than other models which means that it doesn’t perform as well or fast on big data (Deng et al. 2016)
In this project we focused on the methodology and application of classification kNN models in the field of healthcare to predict diabetes.
The kNN algorithm is a nonparametric supervised learning algorithm that can be used for classification or regression problems. (Syriopoulos et al. 2023)
In classification, it classifies a datapoint by finding the nearest k data specified and assigning the category based off the majority.
Figure 1 illustrates this methodology with two distinct classes of hearts and circles.
The classification process has three distinct steps:
\[ d = \sqrt{(X_2 - X_1)^2 + (Y_2 - Y_1)^2} \]
The kNN allows the selection of a parameter k that is used by the algorithm to choose how many neighbors will be used to classify the unknown datapoint. Studies recommend using cross-validation or heuristic methods, such as setting k to the square root of the dataset size, to determine an optimal value (Zhang 2016).
Once the k-nearest neighbors are identified, the algorithm assigns the new data point the most frequent class label among its neighbors. In cases of ties, distance-weighted voting can be applied, where closer neighbors have higher influence on the classification decision.
The kNN algorithm assumes similar datapoints will be in close proximity to each other and be neighbors (Zhang 2016).
It also assumes that data points with similar features belong to the same class. (Boateng, Otoo, and Abaye 2020)
In order to increase the accuracy of the model there are a few parameters that we can adjust.
We explored the CDC Diabetes Health Indicators dataset, sourced from the UC Irvine Machine Learning Repository. It is a set of data that was gathered by the Centers for Disease Control and Prevention (CDC) through the Behavioral Risk Factor Surveillance System (BRFSS), which is one of the biggest continuous health surveys in the United States.
Python and the ucimlrepo package was used to import the dataset directly from the UCI Machine Learning Repository, following the recommended instructions. This enabled us to easily save, prepare, and analyze the data in view of the current research.
The dataset consists of 253,680 survey responses and contains 21 feature variables and 1 binary target variable named Diabetes_binary
Diabetes_binary: 0= No Diabetes, 1= Diabetes
Binary Variables: HighBP, HighChol, CholCheck, Smoker, Stroke, HeartDiseaseorAttack, PhysActivity, Fruits, Veggies, HvyAlcoholConsump, AnyHealthcare, NoDocbcCost, DiffWalk, Sex.
Ordinal Variables: GenHlth, MentHlth, PhysHlth, Age, Education, Income
Continuous Variables: BMI
Figure 4 shows a graph of the mean of different features in the data.
Figure 5 shows us outliers in the data that can skew our results
Figure 6 shows the class imbalance present in the data
A correlation heatmap was generated in Figure 7 to examine relationships between variables. The correlation heatmap helps identify strongly correlated features, which may lead to redundancy in the model.
There are no missing values, meaning no imputation is needed.
We have some duplicate values that need to be removed.
There is a class imbalance with the majority of cases not having diabetes.
We chose to create three classification kNN models to illustrate the methodology.
| Table 4: Model Summary | ||||
|---|---|---|---|---|
| Model Name | k value | Weights | Distance | SMOTE |
| Model 1 | 5 | 'uniform' | Euclidean | No |
| Model 2 | 15 | 'uniform' | Euclidean | No |
| Model 3 | 15 | 'distance' | Euclidean | Yes |
The table below shows the summary of the three models.
| Model | k | Weight | SMOTE | Accuracy | F1 Score | Precision | Recall | ROC AUC | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Model 1 | 5 | Uniform | No | 83.22% | 27.77% | 40.66% | 21.09% | 0.71 |
| 1 | Model 2 | 15 | Uniform | No | 84.56% | 22.38% | 48.37% | 14.56% | 0.77 |
| 2 | Model 3 | 15 | Distance | Yes | 67.77% | 39.77% | 27.84% | 69.58% | 0.74 |
Model 2 has the highest accuracy at 84.56% but this accuracy score is high because it is good at detecting the non-diabetic cases which are the majority of cases.
It also has the highest ROC AUC score of 0.77 which means it’s the best model at seperating different classes; however, the recall is 14.56%.
This means the model is only correctly classifying 14.56% of the actual positive cases for diabetes.
Model 3 which has an accuracy of 69.77% and a much higher recall of 69.78%. Model 3 is able to correctly identify about 70% of the positive diabetes cases.
kNN is a promising algorithmic model that can be further improved to detect diabetes.
In this project we created three kNN models that were trained to classify unknown datapoints into diabetes or non-diabetes classes using the data from UC Irvines Machine Learning Repository called CDC Diabetes Health indicators.
We were able to see how fine tuning a kNN model can help us detect diabetes in the healthcare setting.
Model 2 and 3 showed potential with classifying diabetic cases but would need to be furthered improved by being trained with data that shows more dibetic cases if it’s going to be used in a healthcare setting.